Introduction to R and RStudio

Back to top

In this workshop, we will analyze an RNAseq dataset. To do this, we’ll need two things: data and a platform to analyze the data.

You already downloaded the data. But what platform will we use to analyze the data? We have many options!

We could try to use a spreadsheet program like Microsoft Excel or Google sheets that have limited access, less flexibility, and don’t easily allow for things that are critical to “reproducible” research, like easily sharing the steps used to explore and make changes to the original data.

Instead, we’ll use a programming language to test our hypothesis. Today we will use R, but we could have also used Python for the same reasons we chose R (and we teach workshops for both languages). Both R and Python are freely available, the instructions you use to do the analysis are easily shared, and by using reproducible practices, it’s straightforward to add more data or to change settings like colors or the size of a plotting symbol.

To run R, all you really need is the R program, which is available for computers running the Windows, Mac OS X, or Linux operating systems. You downloaded R while getting set up for this workshop.

To make your life in R easier, there is a great (and free!) program called RStudio that you also downloaded and used during set up. As we work today, we’ll use features that are available in RStudio for writing and running code, managing projects, installing packages, getting help, and much more. It is important to remember that R and RStudio are different, but complementary programs. You need R to use RStudio.

To get started, we’ll spend a little time getting familiar with the RStudio environment and setting it up to suit your tastes. When you start RStudio, you’ll have three panels.

knitr::include_graphics("images/r-plotting/initial_rstudio.png")

On the left you’ll have a panel with three tabs - Console, Terminal, and Jobs. The Console tab is what running R from the command line looks like. This is where you can enter R code. Try typing in 2+2 at the prompt (>). In the upper right panel are tabs indicating the Environment, History, and a few other things. If you click on the History tab, you’ll see the command you ran at the R prompt.

In the lower right panel are tabs for Files, Plots, Packages, Help, and Viewer. You used the Packages tab to install tidyverse.

We’ll spend more time in each of these tabs as we go through the workshop, so we won’t spend a lot of time discussing them now.

You might want to alter the appearance of your RStudio window. The default appearance has a white background with black text. If you go to the Tools menu at the top of your screen, you’ll see a “Global options” menu at the bottom of the drop down; select that.

From there you will see the ability to alter numerous things about RStudio. Under the Appearances tab you can select the theme you like most. As you can see there’s a lot in Global options that you can set to improve your experience in RStudio. Most of these settings are a matter of personal preference.

However, you can update settings to help you to insure the reproducibility of your code. In the General tab, none of the selectors in the R Sessions, Workspace, and History should be selected. In addition, the toggle next to “Save workspace to .RData on exit” should be set to never. These setting will help ensure that things you worked on previously don’t carry over between sessions.

Let’s get going on our analysis!

One of the helpful features in RStudio is the ability to create a project. A project is a special directory that contains all of the code and data that you will need to run an analysis.

At the top of your screen you’ll see the “File” menu. Select that menu and then the menu for “New Project…”.

When the smaller window opens, select “Existing Directory” and then the “Browse” button in the next window.

Navigate to the directory that contains your code and data from the setup instructions and click the “Open” button.

Then click the “Create Project” button.

Did you notice anything change?

In the lower right corner of your RStudio session, you should notice that your Files tab is now your project directory. You’ll also see a file called un-report.Rproj in that directory.

From now on, you should start RStudio by double clicking on that file. This will make sure you are in the correct directory when you run your analysis.

We’d like to create a file where we can keep track of our R code.

Back in the “File” menu, you’ll see the first option is “New File”. Selecting “New File” opens another menu to the right and the first option is “R Script”. Select “R Script”.

Now we have a fourth panel in the upper left corner of RStudio that includes an Editor tab with an untitled R Script. Let’s save this file as rnaseq_analysis.R in our project directory.

We will be entering R code into the Editor tab to run in our Console panel.

On line 1 of rnaseq_analysis.R, type 2+2.

With your cursor on the line with the 2+2, click the button that says Run. You should be able to see that 2+2 was run in the Console.

As you write more code, you can highlight multiple lines and then click Run to run all of the lines you have selected.

Packages

TODO

Let’s load our first package, the tidyverse with library(tidyverse).

Go ahead and run that line in the Console by clicking the Run button on the top right of the Editor tab and choosing Run Selected Lines. This loads a set of useful functions and sample data that makes it easier for us to do complex analyses and create professional visualizations in R.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

What’s with all those messages???

When you loaded the tidyverse package, you probably got a message like the one we got above. Don’t panic! These messages are just giving you more information about what happened when you loaded tidyverse. The tidyverse is actually a collection of several different packages, so the first section of the message tells us what packages were installed when we loaded tidyverse (these include ggplot2, dyplr, and tidyr, which we will use a lot!

The second section of messages gives a list of “conflicts.” Sometimes, the same function name will be used in two different packages, and R has to decide which function to use. For example, our message says that:

dplyr::filter() masks stats::filter()

This means that two different packages (dyplr from tidyverse and stats from base R) have a function named filter(). By default, R uses the function that was most recently loaded, so if we try using the filter() function after loading tidyverse, we will be using the filter() function > from dplyr().

The tidyverse vs Base R

If you’ve used R before, you may have learned commands that are different than the ones we will be using during this workshop. We will be focusing on functions from the tidyverse. The “tidyverse” is a collection of R packages that have been designed to work well together and offer many convenient features that do not come with a fresh install of R (aka “base R”). These packages are very popular and have a lot of developer support including many staff members from RStudio. These functions generally help you to write code that is easier to read and maintain. We believe learning these tools will help you become more productive more quickly.


Variables and Types

  • character
  • logical
  • numeric
name <- "Kelly"
favorite_color <- "green"
height_inches <- 64
likes_cats <- TRUE

You can think of variables as labelled boxes where you store your belongings.

If you’re not sure of the type of a variable, you can find out with the class() function.

class(name)
## [1] "character"
class(favorite_color)
## [1] "character"
class(height_inches)
## [1] "numeric"
class(likes_cats)
## [1] "logical"

Assigning values to objects

Try to assign values to some objects and observe each object after you have assigned a new value. What do you notice?

name <- "Ben"
name
height <- 72
height
name <- "Jerry"
name

{: .source}

Solution

When we assign a value to an object, the object stores that value so we can access it later. However, if we store a new value in an object we have already created, it replaces the old value. The height object does not change, because we never assign it a new value. {: .solution} {: .challenge}

Guidelines on naming objects

  • You want your object names to be explicit and not too long.
  • They cannot start with a number (2x is not valid, but x2 is).
  • R is case sensitive, so for example, weight_kg is different from Weight_kg.
  • You cannot use spaces in the name.
  • There are some names that cannot be used because they are the names of fundamental functions in R (e.g., if, else, for; see here for a complete list). If in doubt, check the help to see if the name is already in use (?function_name).
  • It’s best to avoid dots (.) within names. Many function names in R itself have them and dots also have a special meaning (methods) in R and other programming languages.
  • It is recommended to use nouns for object names and verbs for function names.
  • Be consistent in the styling of your code, such as where you put spaces, how you name objects, etc. Using a consistent coding style makes your code clearer to read for your future self and your collaborators. One popular style guide can be found through the tidyverse. {: .checklist}

The above variables hold a single value. What if you need multiple values?

dial_numbers <- c(1, 2, 3)
dial_numbers2 <- 1:3
class(dial_numbers)
## [1] "numeric"
fan_settings <- c('low', 'medium', 'high')
class(fan_settings)
## [1] "character"
fan_settings <- factor(c('low', 'high', 'medium', 'high', 'medium'), levels = c('low', 'medium', 'high'))

What if you want to store attributes about lots of different things, like patient samples? For that we use data.frames. Let’s find out how to read in a dataset as a data.frame.

Loading and reviewing data

Back to top

TODO - read.csv - read_csv - data.table::fread - readxl::read_excel

Data objects

There are many different ways to store data in R. Most objects have a table-like structure with rows and columns. We will refer to these objects generally as “data objects”. If you’ve used R before, you many be used to calling them “data.frames”. Functions from the “tidyverse” such as read_csv work with objects called “tibbles”, which are a specialized kind of “data.frame.” Another common way to store data is a “data.table”. All of these types of data objects (tibbles, data.frames, and data.tables) can be used with the commands we will learn in this lesson to make plots. We may sometimes use these terms interchangeably. {: .callout}

Understanding functions

Let’s start by looking at the code RStudio ran for us by copying and pasting the first line from the console into our rnaseq_analysis.R file that is open in the Editor window.

metadata <- read_csv("0_data/metadata.csv")
## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

You should now have a line of text in your code file that started with metadata and ends with a ) symbol.

What if we want to run this command from our code file?

In order to run code that you’ve typed in the editor, you have a few options. We can click Run again from the right side of the Editor tab but the quickest way to run the code is by pressing Ctrl+Enter on your keyboard (Ctrl+Enter on Mac).

This will run the line of code that currently contains your cursor and will move your cursor to the next line. Note that when Rstudio runs your code, it basically just copies your code from the Editor window to the Console window, just like what happened when we selected Run Selected Line(s).

Let’s take a closer look at the parts of this command.

Starting from the left, the first thing we see is metadata. We viewed the contents of this file after it was imported so we know that metadata acts as a placeholder for our data.

If we highlight just metadata within our code file and press Ctrl+Enter on our keyboard, what do we see?

We should see a data table outputted, similar to what we saw in the Viewer tab.

In R terms, metadata is a named object that references or stores something. In this case, metadata stores a specific table of data.

Looking back at the command in our code file, the second thing we see is a <- symbol, which is the assignment operator. It assigns values generated or typed on the right to objects on the left. An alternative symbol that you might see used as an assignment operator is the = but it is clearer to only use <- for assignment. We use this symbol so often that RStudio has a keyboard short cut for it: Alt+- on Windows, and Option+- on Mac.

The next part of the command is read_csv("0_data/metadata.csv"). This has a few different key parts. The first part is the read_csv function. You call a function in R by typing it’s name followed by opening then closing parenthesis. Each function has a purpose, which is often hinted at by the name of the function. Let’s try to run the function without anything inside the parenthesis.

read_csv()
## Error in vroom::vroom(file, delim = ",", col_names = col_names, col_types = col_types, : argument "file" is missing, with no default

We get an error message. Don’t panic! Error messages pop up all the time, and can be super helpful in debugging code.

In this case, the message tells us “argument”file” is missing, with no default.” Many functions, including read_csv, require additional pieces of information to do their job. We call these additional values “arguments” or “parameters.” You pass arguments to a function by placing values in between the parenthesis. A function takes in these arguments and does a bunch of “magic” behind the scenes to output something we’re interested in.

For example, when we loaded in our data, the command contained "0_data/metadata.csv" inside the read_csv() function. This is the value we assigned to the file argument. But we didn’t say that that was the file. How does that work?

Pro-tip

Each function has a help page that documents what arguments the function expects and what value it will return. You can bring up the help page a few different ways. If you have typed the function name in the Editor windows, you can put your cursor on the function name and press F1 to open help page in the Help viewer in the lower right corner of RStudio. You can also type ? followed by the function name in the console.

For example, try running ?read_csv. A help page should pop up with information about what the function is used for and how to use it, as well as useful examples of the function in action. As you can see, the first argument of read_csv is the file path.

The read_csv() function took the file path we provided, did who-knows-what behind the scenes, and then outputted an R object with the data stored in that csv file. All that, with one short line of code!

Do all functions need arguments? Let’s test some other functions:

  Sys.Date()
## [1] "2022-05-18"
  getwd()
## [1] "/Users/kelly/projects/carpentries/2022_ASM_Microbe_RNAseq"

While some functions, like those above, don’t need any arguments, in other functions we may want to use multiple arguments. When we’re using multiple arguments, we separate the arguments with commas. For example, we can use the sum() function to add numbers together:

sum(5, 6)
## [1] 11

Learning more about functions

Look up the function round. What does it do? What will you get as output for the following lines of code?

round(3.1415)
round(3.1415,3)

{: .source}

Solution

round rounds a number. By default, it rounds it to zero digits (in our example above, to 3). If you give it a second number, it rounds it to that number of digits (in our example above, to 3.142) {: .solution} {: .challenge}

Notice how in this example, we didn’t include any argument names. But you can use argument names if you want:

read_csv(file = '0_data/metadata.csv')
## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 20 × 9
##    libID     ptID  condition age_dys sex   ptID_old RNAseq methylation total_seq
##    <chr>     <chr> <chr>       <dbl> <chr> <chr>    <lgl>  <lgl>           <dbl>
##  1 pt01_Med… pt01  Media       12410 M     pt00001  TRUE   FALSE        9114402.
##  2 pt01_Mtb  pt01  Mtb         12410 M     pt00001  TRUE   FALSE        8918699.
##  3 pt02_Med… pt02  Media       12775 M     pt00002  TRUE   FALSE        9221555.
##  4 pt02_Mtb  pt02  Mtb         12775 M     pt00002  TRUE   FALSE        7733260.
##  5 pt03_Med… pt03  Media       11315 M     pt00003  TRUE   FALSE        6231728.
##  6 pt03_Mtb  pt03  Mtb         11315 M     pt00003  TRUE   FALSE        7105193.
##  7 pt04_Med… pt04  Media        8395 M     pt00004  TRUE   TRUE        10205557.
##  8 pt04_Mtb  pt04  Mtb          8395 M     pt00004  TRUE   TRUE         8413543.
##  9 pt05_Med… pt05  Media        7300 M     pt00005  TRUE   FALSE       15536685.
## 10 pt05_Mtb  pt05  Mtb          7300 M     pt00005  TRUE   FALSE       15509446.
## 11 pt06_Med… pt06  Media        6570 F     pt00006  TRUE   FALSE        7085995.
## 12 pt06_Mtb  pt06  Mtb          6570 F     pt00006  TRUE   FALSE        6588160.
## 13 pt07_Med… pt07  Media        7665 F     pt00007  TRUE   FALSE       10706098.
## 14 pt07_Mtb  pt07  Mtb          7665 F     pt00007  TRUE   FALSE        8576245.
## 15 pt08_Med… pt08  Media        8760 M     pt00008  TRUE   FALSE        9957906.
## 16 pt08_Mtb  pt08  Mtb          8760 M     pt00008  TRUE   FALSE        8220348.
## 17 pt09_Med… pt09  Media        6935 M     pt00009  TRUE   FALSE       13055276 
## 18 pt09_Mtb  pt09  Mtb          6935 M     pt00009  TRUE   FALSE       13800442.
## 19 pt10_Med… pt10  Media        8030 F     pt00010  TRUE   FALSE        8216706.
## 20 pt10_Mtb  pt10  Mtb          8030 F     pt00010  TRUE   FALSE        7599609.

Position of the arguments in functions

Which of the following lines of code will give you an output of 3.14? For the one(s) that don’t give you 3.14, what do they give you?

round(x = 3.1415)
round(x = 3.1415, digits = 2)
round(digits = 2, x = 3.1415)
round(2, 3.1415)

{: .source}

Solution

The 2nd and 3rd lines will give you the right answer because the arguments are named, and when you use names the order doesn’t matter. The 1st line will give you 3 because the default number of digits is 0. Then 4th line will give you 2 because, since you didn’t name the arguments, x=2 and digits=3.1415. {: .solution} {: .challenge}

Sometimes it is helpful - or even necessary - to include the argument name, but often we can skip the argument name, if the argument values are passed in a certain order. If all this function stuff sounds confusing, don’t worry! We’ll see a bunch of examples as we go that will make things clearer.

Reading in an excel file

Say you have an excel file and not a csv - how would you read that in? Hint: Use the Internet to help you figure it out!

{: .source}

Solution

One way is using the read_excel function in the readxl package. There are other ways, but this is our preferred method because the output will be the same as the output of read_csv. {: .solution} {: .challenge}

Comments

Sometimes you may want to write comments in your code to help you remember what your code is doing, but you don’t want R to think these comments are a part of the code you want to evaluate. That’s where comments come in! Anything after a # symbol in your code will be ignored by R. For example, let’s say we wanted to make a note of what each of the functions we just used do:

 Sys.Date()  # outputs the current date
## [1] "2022-05-18"
 getwd()     # outputs our current working directory (folder)
## [1] "/Users/kelly/projects/carpentries/2022_ASM_Microbe_RNAseq"
 sum(5, 6)   # adds numbers
## [1] 11
 read_csv(file = '0_data/metadata.csv') # reads in csv file
## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 20 × 9
##    libID     ptID  condition age_dys sex   ptID_old RNAseq methylation total_seq
##    <chr>     <chr> <chr>       <dbl> <chr> <chr>    <lgl>  <lgl>           <dbl>
##  1 pt01_Med… pt01  Media       12410 M     pt00001  TRUE   FALSE        9114402.
##  2 pt01_Mtb  pt01  Mtb         12410 M     pt00001  TRUE   FALSE        8918699.
##  3 pt02_Med… pt02  Media       12775 M     pt00002  TRUE   FALSE        9221555.
##  4 pt02_Mtb  pt02  Mtb         12775 M     pt00002  TRUE   FALSE        7733260.
##  5 pt03_Med… pt03  Media       11315 M     pt00003  TRUE   FALSE        6231728.
##  6 pt03_Mtb  pt03  Mtb         11315 M     pt00003  TRUE   FALSE        7105193.
##  7 pt04_Med… pt04  Media        8395 M     pt00004  TRUE   TRUE        10205557.
##  8 pt04_Mtb  pt04  Mtb          8395 M     pt00004  TRUE   TRUE         8413543.
##  9 pt05_Med… pt05  Media        7300 M     pt00005  TRUE   FALSE       15536685.
## 10 pt05_Mtb  pt05  Mtb          7300 M     pt00005  TRUE   FALSE       15509446.
## 11 pt06_Med… pt06  Media        6570 F     pt00006  TRUE   FALSE        7085995.
## 12 pt06_Mtb  pt06  Mtb          6570 F     pt00006  TRUE   FALSE        6588160.
## 13 pt07_Med… pt07  Media        7665 F     pt00007  TRUE   FALSE       10706098.
## 14 pt07_Mtb  pt07  Mtb          7665 F     pt00007  TRUE   FALSE        8576245.
## 15 pt08_Med… pt08  Media        8760 M     pt00008  TRUE   FALSE        9957906.
## 16 pt08_Mtb  pt08  Mtb          8760 M     pt00008  TRUE   FALSE        8220348.
## 17 pt09_Med… pt09  Media        6935 M     pt00009  TRUE   FALSE       13055276 
## 18 pt09_Mtb  pt09  Mtb          6935 M     pt00009  TRUE   FALSE       13800442.
## 19 pt10_Med… pt10  Media        8030 F     pt00010  TRUE   FALSE        8216706.
## 20 pt10_Mtb  pt10  Mtb          8030 F     pt00010  TRUE   FALSE        7599609.

{: .callout}